In [1]:
import numpy as np #linear algebra
import pandas as pd #data manipulation and analysis
import matplotlib.pyplot as plt #data visualization
import seaborn as sns #data visualization
import sklearn.preprocessing as skp #machine learning (preprocessing)
import sklearn.cluster as skc #machine learning (clustering)
import sklearn.metrics as metrics
from sklearn.metrics import silhouette_score
from scipy.spatial import ConvexHull, convex_hull_plot_2d
import matplotlib.path as mpath
from adjustText import adjust_text
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
import warnings # ignore warnings
warnings.filterwarnings('ignore')

path = "../../data/cleanData/housingData.csv"

Unsupervised ML Aglorithm: K-Means Clustering¶

The K-Means clustering algorithm is a popular unsupervised machine learning algorithm used for clustering and segmentation of data into distinct groups based on similarity. It is a simple yet effective algorithm that divides a dataset into k number of clusters, each represented by a centroid point.

The algorithm works by iteratively minimizing the sum of squared distances between each data point and the centroid point of its assigned cluster. This process is achieved by assigning data points to their closest centroid and updating the centroid position based on the new cluster assignments.

The algorithm terminates when the centroids no longer move or after a specified number of iterations. Once the algorithm converges, each data point is assigned to its closest centroid, and the resulting clusters represent distinct groups in the data.

Predictions are based on:

  • The number of cluster centers present (k).
  • Nearest mean values, measured in Euclidian distances between observations

How we will be using K-Means to find counties at risk of implementing State Bill 52

K-Means clustering can identify counties with similar demographic variables based on their demographic features such as age, gender, race, income, education level, and more. We gathered a dataset containing information about each county's demographic variables. We can then apply K-Means clustering to this dataset to similar group counties together based on their demographic similarities.

The first step is to decide on the number of clusters (k) we want to create. Then, we can use our domain knowledge from earlier this semester and statistical methods to determine the optimal number of clusters. Once we have decided on the number of clusters, we can apply the K-Means algorithm to the dataset, with each county represented as a data point in a high-dimensional feature space.

The algorithm will group the counties based on their demographic similarities by minimizing the sum of squared distances between the centroid of each cluster and the data points assigned to it. The resulting clusters will represent groups of counties that have similar demographic characteristics.

We can then analyze each cluster to understand the demographic characteristics of the counties in it and use this information to identify patterns and trends. For example, we may find that counties in one cluster have a higher percentage of elderly residents with a lower income and education level. In comparison, counties in another cluster have a higher percentage of younger residents with a higher income and education level.

Housing Data Cluster Analysis
¶

We will be reading in our subset of the Census data that only contain demographic variables that have relationships with housing units and housing costs, along with other variables that may help our analysis.

In [2]:
df = pd.read_csv(path, index_col=0)
df.head()
Out[2]:
County Name Housing units, July 1, 2021, (V2021) Owner-occupied housing unit rate, 2017-2021 Median value of owner-occupied housing units, 2017-2021 Median selected monthly owner costs -with a mortgage, 2017-2021 Median selected monthly owner costs -without a mortgage, 2017-2021 Median gross rent, 2017-2021 Building permits, 2021 Households, 2017-2021 Persons per household, 2017-2021 Living in same house 1 year ago of persons age 1 year+, 2017-2021 Language other than English spoken at home of persons age 5 years+, 2017-2021 Households with a computer, 2017-2021 Households with a broadband Internet subscription, 2017-2021 Banned or not
0 Adams County, Ohio 12703 73.2 118300 1159 403 593 25.0 10163 2.68 24760.26 495.76 23713.66 21014.55 0.0
1 Allen County, Ohio 44707 66.9 126900 1126 457 758 226.0 40671 2.42 88046.22 3151.77 92113.02 86724.51 1.0
2 Ashland County, Ohio 22513 76.5 143100 1124 432 748 61.0 20531 2.46 45148.71 3191.28 46299.66 43317.65 0.0
3 Ashtabula County, Ohio 46355 71.6 121100 1086 426 735 168.0 38332 2.47 84877.86 6521.58 85948.57 80011.01 0.0
4 Athens County, Ohio 26387 59.8 150800 1210 464 840 110.0 22381 2.41 43439.20 3226.91 56222.74 50079.19 0.0

Scaling¶

Scaling is essential when using the KMeans Clustering Algorithm because the Kmeans algorithm group uses the concept of Euclidean distance (distance between two points in a straight line), which is sensitive to the scale of the variables. If the variables are not scaled appropriately, some variables with larger values will significantly impact the distance calculations more than those with smaller values.

Furthermore, scaling the variables allows for a more meaningful interpretation of the clustering results. However, variables with different scales can have different units, making it difficult to compare and interpret their relative importance in determining the clustering structure.

In [3]:
# First we start off removing 'County Name' and 'Banned or not' columns before scaling
cluster_df = df.iloc[:, 1:-1]
In [4]:
# Scaling the new data frame for clustering
sc = skp.StandardScaler()
cluster_scale = np.array(cluster_df)
scaled = sc.fit_transform(cluster_scale.astype(float))
scaled_cluster = pd.DataFrame(scaled, columns=cluster_df.columns)
scaled_cluster.head()
Out[4]:
Housing units, July 1, 2021, (V2021) Owner-occupied housing unit rate, 2017-2021 Median value of owner-occupied housing units, 2017-2021 Median selected monthly owner costs -with a mortgage, 2017-2021 Median selected monthly owner costs -without a mortgage, 2017-2021 Median gross rent, 2017-2021 Building permits, 2021 Households, 2017-2021 Persons per household, 2017-2021 Living in same house 1 year ago of persons age 1 year+, 2017-2021 Language other than English spoken at home of persons age 5 years+, 2017-2021 Households with a computer, 2017-2021 Households with a broadband Internet subscription, 2017-2021
0 -0.461465 0.057532 -0.723934 -0.291559 -0.895488 -1.624353 -0.397810 -0.472996 1.410608 -0.490601 -0.353976 -0.484075 -0.487935
1 -0.148431 -1.020021 -0.511531 -0.453171 -0.177526 -0.181102 -0.148449 -0.144003 -0.437886 -0.149525 -0.253477 -0.150617 -0.148734
2 -0.365513 0.621964 -0.111422 -0.462966 -0.509916 -0.268572 -0.353149 -0.361189 -0.153503 -0.380719 -0.251982 -0.373964 -0.372805
3 -0.132311 -0.216132 -0.654780 -0.649065 -0.589690 -0.382283 -0.220404 -0.169226 -0.082407 -0.166601 -0.125969 -0.180670 -0.183390
4 -0.327621 -2.234405 0.078753 -0.041794 -0.084457 0.536150 -0.292359 -0.341239 -0.508982 -0.389932 -0.250634 -0.325588 -0.337901

Methods Used To Find the Best Number of Clusters¶

Elbow Plot¶

The elbow plot is a graphical method to determine the optimal number of clusters in a k-means clustering algorithm. It plots the within-cluster sum of squares (WSS) against the number of clusters. The WSS score is a way to quantify how well a clustering algorithm can group similar data points. It tells us how far each point is from its assigned group center and how similar the points within each group are. A lower WSS score means that the points within each group are similar and that the groups are more compact and well-defined.

In [5]:
###Decide n-cluster using Elbow Method
wss=[]
k_range = range(1,12)
for i in k_range:
    kmeans = skc.KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(scaled_cluster)
    wss.append(kmeans.inertia_)
fig, ax = plt.subplots(figsize=(8, 6), dpi=80)
plt.plot(k_range, wss, marker='o')
for i, value in enumerate(wss):
    ax.text(i+1.05, value-0.005, round(value,1), fontsize=12, fontweight='bold')
    
plt.xticks(k_range)
# plt.grid()
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WSS')
# plt.savefig('elbow_methodA.png')
plt.show()

The elbow plot shows a decreasing trend in the WSS score as the number of clusters increases. Therefore, the critical point to look for in an elbow plot is the "elbow" point in the graph, where the rate of decrease in the WSS score starts to level off. This point is called the elbow because the graph may look like the arm and elbow joint. Therefore, the optimal number of clusters to use is typically the value at the elbow point. However, if the elbow point needs to be clarified, choosing a value that still provides a good balance between a low WSS score and not too many clusters is recommended to prevent overfitting.

As we can see, the "elbow" point in the graph looks like the optimal number of clusters to use would be around 2 or 3 clusters

Silouette Method¶

The silhouette method is used to evaluate the quality of clustering results. It measures how well each data point in a cluster is separated from data points in other clusters. The silhouette score ranges from -1 to 1, where a score closer to 1 indicates that the data point is well-matched to its cluster and poorly matched to neighboring clusters. As the number of clusters increases, the silhouette score decreases, indicating that the clustering is becoming weaker and less effective.

In [6]:
# Define the range of k values to try
k_range = range(2, 11)

# Define an empty list to store the silhouette scores for each k value
silhouette_scores = []

# Loop over the range of k values
for k in k_range:
    # Fit a KMeans model with the current k value
    kmeans = skc.KMeans(n_clusters=k, random_state=42)
    kmeans.fit(scaled_cluster)  # Replace X with your feature matrix
    
    # Calculate the silhouette score for the current clustering
    silhouette_avg = silhouette_score(scaled_cluster, kmeans.labels_)
    
    # Append the silhouette score to the list of scores
    silhouette_scores.append(silhouette_avg)
    
    # Print the current k value and silhouette score
    print(f"k = {k}, silhouette score = {silhouette_avg:.3f}")
k = 2, silhouette score = 0.739
k = 3, silhouette score = 0.483
k = 4, silhouette score = 0.462
k = 5, silhouette score = 0.409
k = 6, silhouette score = 0.246
k = 7, silhouette score = 0.211
k = 8, silhouette score = 0.230
k = 9, silhouette score = 0.250
k = 10, silhouette score = 0.247

In the output provided, we have the silhouette scores for a range of values of k, where k represents the number of clusters:

  • A score of 0.739 for k=2 indicates that the data points are well separated into two clusters, and the clustering is strong.
  • A score of 0.483 for k=3 suggests that the data points are somewhat separated into three clusters, but the separation is weaker than for k=2.

The optimal value of k is likely to be two, as it has the highest silhouette score. However, it is essential to consider other factors, such as the clusters' interpretability and usefulness for the task at hand, when choosing the number of clusters for a specific application.

Clustering (Using chosen K values)¶

While the silhouette score for k=3 is lower than k=2. For this analysis, choosing k=3 is the optimal choice. Choosing k=3 is because the counties implementing the SB 52 Bill have either a full or partial ban, suggesting that the data should be grouped into three distinct clusters.

In [7]:
# Clustering K Means, K=3
kmeans_3 = skc.KMeans(n_clusters=3,random_state=42)
kmeans_3.fit(scaled_cluster)

kmeans_3.labels_
Out[7]:
array([0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0,
       2, 0, 1, 0, 0, 2, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0,
       2, 0, 2, 2, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 2, 0,
       2, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 2, 2, 0, 2, 0],
      dtype=int32)
In [8]:
# Assign clustering result to each country in the data frame
cluster_df['cluster_id'] = kmeans_3.labels_
cluster_df.head()
Out[8]:
Housing units, July 1, 2021, (V2021) Owner-occupied housing unit rate, 2017-2021 Median value of owner-occupied housing units, 2017-2021 Median selected monthly owner costs -with a mortgage, 2017-2021 Median selected monthly owner costs -without a mortgage, 2017-2021 Median gross rent, 2017-2021 Building permits, 2021 Households, 2017-2021 Persons per household, 2017-2021 Living in same house 1 year ago of persons age 1 year+, 2017-2021 Language other than English spoken at home of persons age 5 years+, 2017-2021 Households with a computer, 2017-2021 Households with a broadband Internet subscription, 2017-2021 cluster_id
0 12703 73.2 118300 1159 403 593 25.0 10163 2.68 24760.26 495.76 23713.66 21014.55 0
1 44707 66.9 126900 1126 457 758 226.0 40671 2.42 88046.22 3151.77 92113.02 86724.51 0
2 22513 76.5 143100 1124 432 748 61.0 20531 2.46 45148.71 3191.28 46299.66 43317.65 0
3 46355 71.6 121100 1086 426 735 168.0 38332 2.47 84877.86 6521.58 85948.57 80011.01 0
4 26387 59.8 150800 1210 464 840 110.0 22381 2.41 43439.20 3226.91 56222.74 50079.19 0
In [9]:
# Removing 'County Name' and 'Banned or not' columns
cluster_df['Banned or not'] = df.iloc[:,-1]
cluster_df['County Name'] = df.iloc[:,0]
cluster_df.head()
Out[9]:
Housing units, July 1, 2021, (V2021) Owner-occupied housing unit rate, 2017-2021 Median value of owner-occupied housing units, 2017-2021 Median selected monthly owner costs -with a mortgage, 2017-2021 Median selected monthly owner costs -without a mortgage, 2017-2021 Median gross rent, 2017-2021 Building permits, 2021 Households, 2017-2021 Persons per household, 2017-2021 Living in same house 1 year ago of persons age 1 year+, 2017-2021 Language other than English spoken at home of persons age 5 years+, 2017-2021 Households with a computer, 2017-2021 Households with a broadband Internet subscription, 2017-2021 cluster_id Banned or not County Name
0 12703 73.2 118300 1159 403 593 25.0 10163 2.68 24760.26 495.76 23713.66 21014.55 0 0.0 Adams County, Ohio
1 44707 66.9 126900 1126 457 758 226.0 40671 2.42 88046.22 3151.77 92113.02 86724.51 0 1.0 Allen County, Ohio
2 22513 76.5 143100 1124 432 748 61.0 20531 2.46 45148.71 3191.28 46299.66 43317.65 0 0.0 Ashland County, Ohio
3 46355 71.6 121100 1086 426 735 168.0 38332 2.47 84877.86 6521.58 85948.57 80011.01 0 0.0 Ashtabula County, Ohio
4 26387 59.8 150800 1210 464 840 110.0 22381 2.41 43439.20 3226.91 56222.74 50079.19 0 0.0 Athens County, Ohio
In [10]:
# Save data as: housingClusterData.csv; this will be used for our further analysis.
cluster_df.to_csv('../viz/housingClusterData.csv')

Visulizing the Clusters¶

In [11]:
# Replacing 'Banned or not' column value with 0=No and 1 = Yes 
cluster_df['banned'] = cluster_df['Banned or not'].replace({0: 'No', 1: 'Yes'})

# Replacing 'cluster_id' column values with string to intger
cluster_df['cluster'] = cluster_df['cluster_id'].replace({0: '0', 1: '1',2:'2'})
In [12]:
cluster_df.head()
Out[12]:
Housing units, July 1, 2021, (V2021) Owner-occupied housing unit rate, 2017-2021 Median value of owner-occupied housing units, 2017-2021 Median selected monthly owner costs -with a mortgage, 2017-2021 Median selected monthly owner costs -without a mortgage, 2017-2021 Median gross rent, 2017-2021 Building permits, 2021 Households, 2017-2021 Persons per household, 2017-2021 Living in same house 1 year ago of persons age 1 year+, 2017-2021 Language other than English spoken at home of persons age 5 years+, 2017-2021 Households with a computer, 2017-2021 Households with a broadband Internet subscription, 2017-2021 cluster_id Banned or not County Name banned cluster
0 12703 73.2 118300 1159 403 593 25.0 10163 2.68 24760.26 495.76 23713.66 21014.55 0 0.0 Adams County, Ohio No 0
1 44707 66.9 126900 1126 457 758 226.0 40671 2.42 88046.22 3151.77 92113.02 86724.51 0 1.0 Allen County, Ohio Yes 0
2 22513 76.5 143100 1124 432 748 61.0 20531 2.46 45148.71 3191.28 46299.66 43317.65 0 0.0 Ashland County, Ohio No 0
3 46355 71.6 121100 1086 426 735 168.0 38332 2.47 84877.86 6521.58 85948.57 80011.01 0 0.0 Ashtabula County, Ohio No 0
4 26387 59.8 150800 1210 464 840 110.0 22381 2.41 43439.20 3226.91 56222.74 50079.19 0 0.0 Athens County, Ohio No 0

Bar chart: Frequecy of Counties in Each Cluster¶

In [13]:
fig = px.histogram(cluster_df, x='cluster', color='cluster', pattern_shape="banned",
                   category_orders=dict(cluster=["0", "1", "2"]))
fig.show()
# edit x and y axis lables, and title 

The plot shows the number of counties in each cluster. The shaded region in the plot shows each cluster's banned counties.

Scatter Plot: Ground Truth Classification VS. K-Means Classifications¶

Ground truth classification usually refers to a process in supervised learning where the labels for the data are already known and used to train a model. The model is then evaluated based on how well it can predict the correct label for new data. Though it is not common to use ground truth classification in unsupervised learning, it can still be used in unsupervised learning attempts to find structure or patterns in the data without any preconceived notions of what the data should look like.

On top of that, we can see where the counties with bans lie among all the other counties.

Ground Truth Classification Scatter Plot¶

In [14]:
# Use scatter plot from last progress report
fig = px.scatter(cluster_df, x="Median value of owner-occupied housing units, 2017-2021", y="Owner-occupied housing unit rate, 2017-2021",
                 color="banned",hover_data=['County Name'], symbol="banned", category_orders=dict(banned=["no", "yes"]))
fig.show()

When looking at the ground truth plot, we see that banned counties, Seneca County, Logan County, and Auglaize County, all appear in a cluster, where the Median Value of Owner-Cccupied is between 100K-150K and the Owner-Occupied Housing Unit Rate between 70-80 Units—hinting that these three counties may be in a cluster with a lot of different counties without bans.

KMeans Classification Scatter Plot¶

In [15]:
# Use scatter plot from last progress report
fig = px.scatter(cluster_df, x="Median value of owner-occupied housing units, 2017-2021", y="Owner-occupied housing unit rate, 2017-2021",
                 color="cluster",hover_data=['banned', 'County Name'], symbol="banned", category_orders=dict(cluster=["0", "1", "2"]))
fig.show()

# use subplots 

Looking at the counties named above, we can see they all have been put into the same cluster. This shows how our group truth classification and KMeans classification scatter plots gave similar insight regarding which counties may get clustered together.

Table¶

In [16]:
# Create a table of the clusters and average the x, y vars.
scatterPlotdf = cluster_df[['cluster', "Median value of owner-occupied housing units, 2017-2021",
                            "Owner-occupied housing unit rate, 2017-2021", "Banned or not"]]

newNames = scatterPlotdf.rename(columns={"Median value of owner-occupied housing units, 2017-2021":'Median owner-occupied housing units',
                              "Owner-occupied housing unit rate, 2017-2021":'Owner-occupied housing unit rate',
                                        "Banned or not":"Number of banned counties"})

# Group the DataFrame by the 'cluster' column and calculate the mean and sum
grouped_df = newNames.groupby('cluster').agg({'Median owner-occupied housing units': 'mean',
                                                    'Owner-occupied housing unit rate': 'mean',
                                                    'Number of banned counties': ['sum']})


# Reset the index to make the 'cluster' column a regular column
grouped_df = grouped_df.reset_index()
In [17]:
fig = go.Figure(data=[go.Table(
    header=dict(values=list(grouped_df.columns.get_level_values(0)),
                fill_color='paleturquoise',
                align='left'),
    cells=dict(values=[grouped_df.cluster, round(grouped_df['Median owner-occupied housing units'],2),
                       round(grouped_df['Owner-occupied housing unit rate'],2), grouped_df['Number of banned counties']
                      ],
               fill_color='lavender',
               align='left'))
])

fig.show()

The table shows the Mean values for each variable in the scatter plots, their respected cluster, and the number of banned counties in each cluster.

Evaluating Cluster results¶

In [18]:
from sklearn.metrics import classification_report

y_true = cluster_df['Banned or not']
y_pred = cluster_df['cluster_id']
cluster_names = ['cluster 0', 'cluster 1', 'cluster 2']
print(classification_report(y_true, y_pred, target_names=cluster_names))
              precision    recall  f1-score   support

   cluster 0       0.89      0.76      0.82        78
   cluster 1       0.00      0.00      0.00        10
   cluster 2       0.00      0.00      0.00         0

    accuracy                           0.67        88
   macro avg       0.30      0.25      0.27        88
weighted avg       0.79      0.67      0.73        88

This report shows the results of a k-means clustering model.

Here is a quick explanation of what each metric means and how it is used to evaluate the performance of the model:

  • Precision: the proportion of accurate positive predictions out of all optimistic predictions made by the model.
  • Recall: the proportion of accurate optimistic predictions from all actual positive instances in the data.
  • F1-score: the harmonic mean of precision and recall, which gives a balanced measure of their performance.
  • Support: the number of instances in each cluster.

For cluster 0, the precision is 0.89, indicating that 89% of the data points that the model predicted as cluster 0 were actually in cluster 0. The recall is 0.76, indicating that the model correctly identified 76% of the data points in cluster 0. The F1-score is 0.82, a harmonic mean of precision and recall, representing the model's overall performance on this cluster. For cluster 1 and cluster 2, the precision, recall, and F1-score are all 0 because the model did not predict any data points in those clusters. The model's accuracy is 0.67, which means that the model correctly classified 67% of the data points.

The model performed well on cluster 0 but failed to identify any data points in clusters 1 and 2. Nevertheless, the weighted average F1-score of 0.73 suggests that the model's overall performance is satisfactory, compared to the 67% accuracy.

Conclude and lead them to next .ipynb file¶

In the file named 'housingViz.ipynb', we will create a convex hull around all the banned counties in each cluster and look closer to find the variable with the most counties and the counties that appear in the most common variables.

In [ ]: